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Abstract 

Parameter estimates for associated genetic variants, reported in the discov- 
ery samples are often grossly inflated compared to the values observed in the 
follow-up samples. This type of bias is a consequence of the sequential pro- 
cedure as a declared associated variant must first pass a stringent significance 
threshold. We propose a hierarchical Bayes method in which a spike-and-slab 
prior is used to account for the possibility that the significant test result may 
be due to chance. We investigate the robustness of the method using different 
priors corresponding to different degrees of confidence in the testing results and 
propose a Bayesian model averaging procedure to combine estimates produced 
by different models. The Bayesian estimators yield smaller variance compared 
to the conditional likelihood estimator and outperform the latter in studies with 
low power. We investigate the performance of the method with simulations and 
illustrate it using four real data examples. 

Keywords: Bayesian Model Averaging, Hierarchical Bayes Model, Spike-and-Slab 
Prior, Winner's Curse. 



1 Introduction 



Parameter estimates (e.g. odds ratio) for associated genetic variants (e.g. Single- 
Nucleotide Polymorphisms), reported in the discovery samples a re often gr o ssly i n- 
flated compared to the values observed in the follow-up samples ( INair et al.l (120091 )). 
This type of bias is a consequence of model selection, because a declared associated 
variant must pass a stringent significance threshold as well as be the winne r among a ll 
competing variants. T his phenomenon is also known as the Beavis effect ( jXul (120031 )) 
or the Winner's curse (jZollner and Pritchardl (120071 )) in the biostatistics literature. 

The Winner's curse has recently gained much attention in genetic studies, be- 
cause it's been recognized as one of the ma jor contributing facto rs to the failures 
of many attempted replication studies (e.g. 



Ioannidis et al. 



d2009|)). For example, 



five Nature Genetic publications in the first three months of 2009 acknowledged the 
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effect of Winner's curse (e.g., iNair et al.l . 120091 ) . In their recent Nature Review pa- 
per entitled "Validati ng, augmenting and refining genome-wide association signals" , 



Ioannidis et al 



(120091 ) dedicated a section to the Winner's curse and emphasized that 
11 the magnitude of the winner's curse is inversely related to the power of the study. In 
typical circumstances, for 10% power, the inflation of an additive effect could be ap- 
proximately 60%... For small effects [anticipated for susceptibility loci associated with 
complex diseases/traits], even large meta-analyses could be grossly under-powered and 
emerging associations could be considerably inflated. For rare variants, the power can 
be < 

Some authors (e.g., iGoring et all l200ll ) have argued that reliable parameter es- 



timates can only be obtained from an independent sample. However, collecting ad- 
ditional samples could be undesirable in some cases due to, for example, time and 
budget constraints as well as concerns over population heterogeneity and sampling dif- 
ferences. Two categories of methods were subsequently proposed to correct for the se- 
lect ionbjijsjj^inj^ only: the mode l -free r esampling based methods 



.e.g., 



Sun and Bulll 



meth ods (e.g. 



2005 



Wu et al 



Zollner and Pritchard 



2006 



Yu et al 



2007 



Ghosh et al. 



2007) and the likelihood based 



2008 



Zhong and Prentice! . 



20081 ). Both types of approaches were shown to substantially reduce the estimation 



bias in relatively modest sample size s, and comparable performances in terms of ac 



curacy and efficiency were observed (IFaye et al. 



20081 ). However, one caveat is that 



the variances of the proposed estimators in both categories are considerably higher 
than the original naive estimator, rendering highly variable estimates of sample size 
for replication studies, e yen if the Root Mean Squar ed Errors (RMSE) are lower. 
For example, Figure 4 of IZollner and Pritchard! (120071 ) shows that the bias-adjusted 
sample size estimates range from ~ 500 to ~ 100, 000 compared to the actual re- 
quired sample size of 1,261 for a successful replication study (a = 10~ 6 , 1 — (3 = 80%). 
The increased variance of the bias-reduced estimators via either the resampling or 
likelihood methods is a consequence of the "double use" of the same data for both 
variant detection and parameter estimation. The original sample size n can be decep- 
tively large for parameter estimation if the same samples were first used for selecting 
the most promising variant (s). The loss of effective sample size for estimation is in- 



versely proportional to the po wer o 



contamination as discussed by iMengi (119941 ) . 



variant detection. It is a form of hidden data 
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Motivated by the above observations and the fact that some form of prior infor- 
mation is often available in genetic studies, we propose here a Bayesian framework 
to further reduce the bias and decrease the variance of the estimates. In particu- 
lar, we focus on the log odds ratio estimates from genome-wide association (GWA) 
studies via logistic regression analyses of case-control disease status, because most of 
the current genetic mapping studies adopt the GWA study design. We first describe 
the statistical model in Section 2. We prove in Section 3 that there are no unbiased 
estimators for log odds ratio obtained conditionally on statistical significance. We 
present the Bayesian methodology in Section 4 with detailed discussions on the prior 
specifications and the advantages of model averaging. We assess the performance 
of the proposed methods in Section 5 via simulation studies and demonstrate their 
utility in Section 6 using four studies. Our concluding remarks are in Section 7. 



2 The Statistical Model 



Let (3 refer to the true log odds ratio (LOR), the parameter of interest, for the risk 
allele of an a ssocia ted SNP, and Z the statistic for the associate test. Following 



Ghosh et al. 



has the form 



(120081 ). we assume that Z is asymptotically normally distributed and 



SE(P) 



N 



SE0) 



where (3 is the estimate for f3 from the logistic regression in which the response variable 
is the affection status of the sample (i.e. 0=unaffected and l=affected by the disease 
of interest) and the predictor is the SNP genotypes coded additively (i.e. 0, 1 or 2 
copies of the risk allele), with or without other covariates. Without loss of generality, 
we assume that the minor allele is the risk allele and the alternative of interest is 
one-sided, i.e. H : \i = vs. Hi : fi > 0. 

In a more general statistical setup, we assume that n iid samples, {X±, . . . ,X n }, 
were collected from a population with mean /x and variance a 2 . We assume that a 2 
is known because in genetic associat ion studies, a 2 dep e nds o n the allele frequency of 
a SNP and can be easily estimated (ISlager and Schaidl . 120011 ). We consider a normal 
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test for H : /j = vs. Hi : /x > based on the sample mean, 



cr/y/n 



N 



1 



The standard practice is to directly use X, from the same sample, as an estimate 
for fi unconditional on the null hypothesis being rejected, i.e. T n > c, where c is the 
critical value corresponding to type I error rate a, ignoring the fact that estimation is 
performed for samples with positive findings only. Note that, although E[X] = fi, the 
conditional mean > coj\fn\ > \x. As a result, such naive estimate is upward 

biased unless the power of the test is 100%. Th e amount of b i as is inversely propor- 
tional to the power as was fi rst demonstrat ed by lGoring et al.l (120011 ) in genome-wide 
linkage settings and later by iGarnerl (120071) for genome-wid e association studies. The 



likelihood based methods proposed by iGhosh et al.l (120081 ) and other authors essen- 
tially correct for this selection bias by calculating the maximum likelihood estimate 
(MLE) from the correct conditional likelihood. In this setting, 



f(T n \T n > c) 



<t>{T n 



(2.1) 



where and $ are the probability density function (pdf) and cumulative distribution 
function (cdf ) of the standard normal distribution. 



3 Lack of Unbiased Estimators for /i 



Ghosh et al 



(120081 ) and other authors have demonstrated that the MLE from the 
correct conditional likelihood could substantially reduce the bias. However, they also 
observed via simulation studies that the conditional MLE tends to over-correct for 
large \i and under-correct for small \x. Here, we prove analytically that an unbiased 
estimator for /j conditional on statistical significance does not exist. 

Since T n is a sufficient statistic for \x assuming that a 2 is known, the completeness 
of the normal family of distributions implies that if there exists no function h such 
that E^[/i(T n )] = ^y^, then no unbiased estimator of /V exists. Therefore, we can 
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restrict the search for unbiased estimators of /V to functions of TL. 

a/y/n n 

Now suppose that some function h(T n ) is an unbiased estimator of conditional 
on the statistical significance of T n , T n > c. Let g(T n ) = {T n — h{T n )}\T n > c, then 



E[g(T n )} = E[T n \T n >c}-E[h(T n )\T n >c] 



T 



cr/y/n 



1 ^oo 

~K 



d(T n ) 



ohJn 



7/ yfl 



2 + 



a \ n 



1 
1 



2; ■ e 2 + if 



/' 



r/y/n 



a U n 



C + A ■ 



l-$(c 



a I ' ypn, 



cr/y/n 



a/y/n 



where K = 1 - $ c /V 1 

Thus, we have 



a/y/n' 



c- 



a/y/n' 



v a/y/n' 



a/y/n' 



(3.1) 



Multiplying both sides by 1 — $(c — -^j=) gives 



g(T n )4>(T n r - 7 =) dT n = d)(c 



o \ n 



<J \ n 



(3.2) 



Now, let S^(y) to be a Dirac-delta function defined for y > c such that it is equal to 
for all y greater than c with the property that j^S^(y) dy = 1 for all e > 0. It is easy 
to see that a solution to equation (13.21) is g{T n ) = <5+(T n ). By the completeness of the 
normal distribution, the solution g(T n ) is unique almost everywhere on [c, 00). Thus, 
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h{T n )\T n > c = T n \T n > c is true almost everywhere on [c, oo). Hence, T n \T n > c is 

. However, T n \T n > c has an upward bias equals 



also an unbiased estimator for 



to 



<y/\/n 



l-*(c- 



. Therefore, we conclude that there are no unbiased estimators of —r- 



a/y/n 



and hence no unbiased estimators of /x 



A similar argument was used by 



Stallard et al. 



(120081 ) who showed that there is no 



conditional unbiased estimator for the effect of treatment A from a sample that was 
first used to select treatment A over B, i.e. conditioning on the fact that the sample 
effect of treatment A was larger than that of treatment B. 



4 Bayesian Bias Correction 



4.1 Prior Specification 

The possible available prior information for genome-wide association studies is diverse, 
for example, results from previous genome- wide linkage analyses and/or candidate 
studies and/or biological evidence of the SNPs. However, one common theme is 
the anticipated low power of the GWA studies and the well acknow ledged fact that 
an ap parent significantly associated SNP could be a false positive (jloannidis et all 
20091 ). Thus, the performance of the proposed Bayesian methods is assessed in this 
context, although the practical implementation of the methods could be study specific 
depending on the available prior. 

The Bayesian paradigm allows us to incorporate in our model the prior belief that 
the significance of the effect observed may be due to chance. Mathematically, this 
belief can be modeled using a "spike and slab" prior which is essentially a mixture 
between a discrete probability with mass at zero and a continuous density / with 
support on the positive real line 



p(a*) = ^{o}0*) + (i-O/0*)- 



Spike and slab priors have a long history in the Bayesia n literature on variable se 



l ection and shrinkage estimation , e.g. 



fll988lUGeorge and McCullochl (Il993h . IChipmanl (11999 ) 



Box and Mever ( 



1986). 



Mitchell and Beauchamp 



Clyde et al. 



1996|) 



Geweke 



( 1996 ). and lKuo and Mallickl (119981 ). A recent theoretical study by llshwaran and Rao 
( 120051 ) discusses similarities between Bayesian procedures using spike and slab priors 
and frequent ist procedures. 

We use Uniform(0, 2) to specify /(//), the density function for the normalized 
LOR, where the upper bound of the uniform prior was chosen as the maximum 
attained value of realistic LOR of susceptibility loci of complex diseases and traits. 
For example, the truly associate SNP in the well known major histocompatibility 
complex (MHC) region has perhaps the highest genetic effect observed to date, with 
a log odd ratio of log(5A9) = 1.7 (WTCCC, 2007). However, we should mention 
that our simulations have shown that the results remain largely the same if the prior 
support for /x is larger (e.g. fi — 6). We treat £ as a hyperparameter with a Beta 
distribution, 

= Beta(a, 6). 

The parameters a, 6 reflect our degree of prior belief in \x = (false positive) versus 
/j > (true positive). If we set a = b = 1, then p(£\a = 1,6 = 1) has the Uniform(0, 1) 
density, which implies that we do not favor, a priori, any region of (0, 1). This could 
be considered the "noninformative prior" for £. Smaller values for a and larger values 
for b, say a = 0.5 and 6 = 8, represent a higher prior confidence that the signal is real. 
Similarly, larger values for a and smaller values for 6, say a = 8 and 6 = 0.5, correspond 
to prior skepticism regarding the observed association between the significant SNP 
and the disease/trait of interest. The choice a = 2/3 and 6 = 2/3 corresponds to our 
belief in two extreme outcomes, that is £ is close to either or 1. Figure [1] shows the 
Beta distribution of £ for different setting of a and 6. 

We reparametrize the model using 9 = fi/2 for easier implementation. Therefore, 
the proposed Bayesian method has the following hierarchical structure 

p{0)=t9o(e) + (l-t)gi(6), (4.1) 

= Beta(a, 6), 

where go(9) = 5{o}(#) and g± is the density of Uniform(0, 1). 
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Figure 1: Prior density of £ for different choices of a and b. 



-M- beta(2/3,2/3) 

- H- beta(0.5,8) 

L beta(8,0.5) 

-U- beta(1,1) 




— i i i i i r~ 

0.0 0.2 0.4 0.6 0.8 1.0 
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4.2 Posterior distribution 



To obtain the posterior distribution of 9, we introduce a mixture indicator Z and let 
Z = if the significant SNP is a false positive (9 ~ go) and Z = 1 for a true positive 
(0~<7i): 

p(*|Z) = * Z = ° 

\ ^(0), ifZ = l. 

Thus, conditional on Z, the joint prior distribution for (0, £) is 



^oWHi-O 6-1 , if ^ = o 

'(i-o 6-1 , if^=i. 



Therefore, conditional on Z, the posterior distribution for the vector (6,£) can be 
expressed as: 



The posterior distribution of (9,£) is then 
n(6,£\T n ) = p(0,£\Z = 0,T n )Pr(Z = 0\T n ) +p(6,£\Z = l,T n )Pr(Z = l\T n ). (4.3) 



4.3 Sampling from the Posterior Distribution 



The characteristics of the posterior distribution cannot be studied analytically due 
to its complex form. Thus we propose to use Markov chain Monte Carlo (MCMC) 
to sample from ir. The poster ior distribution ha s a m ixture form for which the 
Data Augmentation algorithm of iTanner and Wong) (119871 ) has been proven extremely 
efficient. The algorithm relies on sampling alternatively from the distribution of 
Z\T n ,9, £ and 9,£\Z,T n . More precisely, at iteration t we carry out the following 
steps: 
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Step 1 Sample Z t G {0, 1} given and 9 t _i from the conditional distribution 



0, with probability 



Po 



t-1 



PO+Pl 



where 



Po 



1 with probability p * 



Ct-i<f>(T n ) 



1 - $ c 



(1 - &-i)0(T re - ^ 



Pi 



1 - $(c 



66> t _i > 



Step 2 i) If Z t = 0, sample 

and set // t = 8 t — 0. 
ii) If Zt — 1, sample 



& ~ Beta(a + 1,6), 



t ~p(0|T n ,Z,£)oc 



60 \ 



60 n 



£t ~ Beta(a, 6+1), 

and set /x t = 29 t . 

The sampling of ft at step 2.ii) cannot be carried out directly so we use a Metropolis- 
Hasting algorithm lMetropolis et al.1 (119531 ). 20,000 iterations are used to obtain 15,000 
simulated samples, discarding the first 5,000 "burn-in" samples. The sample mean of 
the above 15,000 posterior samples of /x, denoted as /ib, is the Bayesian estimate of 
the posterior mean E[/i\T n > c]. 



4.4 Bayesian Model Averaging (BMA) 



The Bayesian model averaging (BMA) is a coherent and co nceptually simple me thod 



devised to take into account the model uncertainty (see iHoeting et al 



1999, and 



references therein). For the problem discussed here, the uncertainty is related to our 
lack of information regarding the power of the test performed in the first stage. If we 
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knew say, that the power of the test is high, then we would be more confident that 
the signal detected is a true signal and this would be reflected in our choice of the 
prior. In the absence of such information, one could adopt the BMA methodology to 
increase the robustness of the Bayesian estimator. 

Assume that A is the quantity of inferential interest, for which a number of can- 
didate models, say Mi, ...,Mk, are available. Given the prior probability for each 
candidate model, p(Mi), 1 < % < K, the traditional BMA method assigns the poste- 
rior distribution given data D for A 

K 

p(A\D) = 5>(A|M fc , D)p(M k \D), (4.4) 
k=i 



where 



and 



p{D\M k )p{M k ) 
p{M k \D) = — 



Y:f =1 p{D\Mi)p{Mi) 



p{D\M k )= p(D\9 k ,M k )p(6 k \M k )d9 



k- 



In our setting, K=2 because only two models are considered. Let M\ be the 
model with prior = Beta(8, 0.5) and M 2 for p(£) = Beta(0.5,8). To specify the 
values for p(M{) and p(M 2 ), we utilize the threshold value c in the following fashion, 
p(Mi) = e ( -~ C//2 ' ) and p(M 2 ) = 1 — e ( -~ c ^ 2 \ Thus our prior belief in model M 1 (with 
higher density for false positive) decreases as the testing threshold value increases at 
an exponential rate. The posterior probabilities for the two models can be derived 
as: 

P (M-\T) = PCWMMQ 

n 41 n > p(T n \M 1 )p(M 1 )+p(T n \M 2 )p(M 2 y 

Thus, 

p(Mi\T n ) p(T„|M0 e(" c / 2 ) 



p{M 2 \T n ) p(T n \M 2 ) (l-e(- c / 2 ))' 
The direct computation, however, is difficult because the integral 

p(T n \M)= [ [ p(T n \^,(,M)p(^,M)p(C\M)diidC 



(4.5) 
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cannot be calculated in a closed form. Note that 



p(ji,£\T n ,M) 



p(T n \M,y,Z)p(»\Z,M)p(Z\M) 
p(T n \M) 



(4.6) 



thus p(T n \M) can be viewed as the normalizing constant of the posterior distribution 
p([/,,£\T n , M). Therefore, the first ratio in (14.51) is a ratio of two normalizing constants 
for two densities from which we can sample. The problem of estimating ratios of two 
nor malizing constants has bee n discussed by, among others, iMeng and Wong! (119961 ) 



and 



Gelman and 



Vlengi (119981 ). We use the bridge sampling method proposed by 



Meng and Wongl (119961 ) to compute the ratio in (14.51) . 
To compute (14.51) . let to — (fi, £) and 



and 



Pi =p(ft,Z\T n ,Mi] 



= p{T n \M t , fx, 0p(^, M t )p(Z\Mi, 



1,2. 



First, we simulate = 10,000 samples {{[in, £a)> {fam, £mj} fr° m eac h density 
Pi, i = 1,2. Then we compute Uj as follows: 



p{T n \M 2 , Liij, 0, )/'(/',., C,j. M 2 )p(£ij\M 2 ) 





Mi) 


p(dj 


M 2 ) 



-7.5 



If we denote the bridge sampling estimate f for ^"jj^j then, from equations (14.41) 
and (14.51) . we obtain the BMA estimator of /i 



re 



(-c/2) 



Hbma 



l - e (- c / 2 ) 



f e (-c/2) _|_ I _ e (-c/2)^ 1 f e (-c/2) + l_ e (-c/2) 



A*2, 



where /ti and /12 are the posterior means of \x obtained under model M\ and model 
M 2 , respectively. 
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Table 1: Sample size needed to obtain the desired power (rj) at the pre-specified type 
1 error rate (a) when fi = 0.0953 = log (1.1) and a = 1.685. 



a\?7 


0.1 


0.2 


0.5 


0.9 


0.99 


0.05 


41 


202 


846 


2678 


4932 


10" 4 


1857 


2588 


4323 


7816 


11423 


10~ e 


3767 


4783 


7062 


11383 


15666 



Remark Although highest posterior density (HPD) regions with posterior mass 
1 — a may be estimated using samples from the posterior under models Mi and M2, 
there is no direct way to construct a HPD for the model averaging estimator. In this 
paper, we will use only the model averaging estimates without a HPD associated to 
them. 

5 Simulation Study 

We performed a set of simulations to investigate the effect of sample size on the 
different estimators, under a complete factorial design. The factors are three levels of 
the type 1 error rate of the association test, a G {0.05, 10~ 4 , 10~ 6 } corresponding to 
threshold values c G {1.645, 3.719, 4.753}, and six levels of the power of the association 
test, r] = {0.1, 0.2, 0.5, 0.9, 0.99}. Throughout the simulation the true mean was fixed 
at fi = log(l.l) = 0.095 (i.e. OR of 1.1). and a was assumed to be known and set 
at 1.6855 so that the corresponding sample size n is reasonable for causal variant 
with low OR at each combination of a and 77 values (Tabled]). The value of 10~ 4 
and 10 -6 for a were chosen for association studies at the genome-wide level and 0.05 
for candidate gene type of studies. Power levels of 10%, 20% and 50% reflect the 
low power anticipated for the current genome-wide association studies, while a power 
level of 99% allows us to investigate the asymptotic behavior of the methods. 

Under each scenario, we began by simulating 200 significant data sets, Xi ~ 
N(fi,a 2 ),i = 1, ...,n, i.e., the value of the test statistics T n = -^j^ is greater than c. 
The following seven estimates of fi were computed for each generated data set: 

N: The naive estimate, X 
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MLE: The conditional MLE estimate based on equation (12.11) 

B.L: The Bayesian estimate when the prior for £ is Beta(8, 0.5), 

B.H: The Bayesian estimate when the prior for £ is Beta(0.5, 8), 

B.BMA: The BMA estimate obtained by averaging the B.L and B.H models, 

B.M: The Bayesian estimator when the prior for £ is Beta(2/3, 2/3), 

B.Unif: The Bayesian estimator when the prior for £ is Uniform(0, 1). 

If the MLE estimate based on equation (12.11) was negative, we set it equal to zero 
following the standard practice of interpreting the "flip-flop" phenomenon occurring 
at the same SNP in the same population. 

The estimation results are shown in Figures 2 and 3 for a = 0.05 and a = 10 -6 , 
respectively. The results confirm that, in the case of low power, the naive estimator 
has a large upward bias. Even in the moderately powered studies, the naive estimator 
considerably overestimates the true mean. 

Figure 2 shows that B.L is the best estimator when the power is low. How- 
ever, when the power increases B.BMA, B.M and B.Unif outperform the others. 
For more extreme significance levels, B.L tends to over-correct and B.H produces 
more accurate estimates as shown in Figure 3. It is clear that B.L and B.H are 
complementing each other so by considering the model average of the two we obtain 
B.BMA, a more robust estimator. Among the three Bayesian estimators, B.Unif 
yields similar estimation results to B.M. The natural implication is that putting 
equal prior weight on [0,1] is equivalent to putting equal weight on £ close to zero or 
close to 1. 

In most of the cases, the Bayesian estimators achieve the anticipated reduction in 
variance compared to the MLE estimator. This advantage over MLE is especially 
obvious in the low power studies. For example, when the power of test is 10%, and 
a = 10" 6 , the RMSE of B.BMA, B.M and B.Unif is 0.033, 0.056, 0.055 respectively, 
while the RMSE of MLE is 0.066. The bias of B.M and B.Unif are very comparable 
to those obtained using MLE. 
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Figure 2: Performance of the estimators for \l = Zog(l.l) = 0.0953 and a = 0.05. 
Top left: power=0.1, Top right: power=0.2, Bottom left: power=0.5, Bottom right: 
power=0.99. The horizontal line represents the true value of fi 
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Figure 3: Performance of the estimators for \i = log(l.l) = 0.0953 and a = 10 -6 . 
Top left: power=0.1, Top right: power=0.2, Bottom left: power=0.5, Bottom right: 
power=0.99. The horizontal line represents the true value of fi 
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6 Application Studies 



We applied our methods as discussed above to fou r datasets, one of which is I) the 
candidate gene association study of Lymphoma by IWang et al.l (120061 ) and three are 
genome- wide association studies of II) type 1 diabetes (T1D) by WTCCC (2007), III) 
psoriasis by Nair et al. (2009) and IV) complications of T1D by Dr. Andrew Paterson 



and his colleagues (personal communication). The Ly mphoma and 



Ghosh et al 



ID data-sets 



(120081 ) via the 



were chosen because they were previously analyzed by 
likelihood-based approach, and the other two studies were chosen because the OR 
estimates obtained from replication studies had been reported by the study authors. 
In addition, the T1D complication dataset allows us to show that the proposed method 
can be easily and robustly applied to quantitative trait studies. In each case, we 
present, in a table, the results of the different Bayesian estimators proposed and 
compare them to the original reported OR (i.e. the naive estimate), the MLE estimate 
and the estimate from available follow-up/replication study. The estimate from the 
replication study is an unbiased estimate of the true effect size, however, the value 
itself should not be viewed as the true parameter value because of the sampling 
variation and potential sub-population and sampling differences between original and 
follow-up studies. For all four examples we present the results obtained from using 
the MLE, B.L, B.H and B.BMA estimators. Although present side by side the 
HPD regions for each Bayesian method and the confidence intervals produced using 
the chi-square approximation for the conditional likelihood method, one should keep 
in mind that the statistical interpretation of these regions is different and therefore 
these are not comparable. 

Example I The Lymphoma study reported two significant SNPs (rsl800629 and 
rs909253) from a candidate gene study using a total of 48 SNPs and 318 cases and 
766 controls. The naive OR estimates are 1.54 and 1.40, respectively. We applied 
the MLE an d Baye sian methods using a p-value threshold of 0.1/48 ~ 0.002 as in 



Ghosh et al. 



(120081 ). The follow up estim ates are the resu 



ts fro m a larger pooled 



analysis involving seven studies reported in lRothman et al.l (120061 ). From the results, 
shown in Table [2j one can see that the Bayesian model averaging estimate is closer 
to the replicated value than the conditional MLE estimate. In this case the average 
model "favors" the B.H model which, in turn, yields estimates very close to the 
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replicated values. 

Example II The GWAS of T1D by WTCCC (2007) used about 2,000 cases and 
3,000 controls genotyped with the 500K Affymetrix chip, and it re ported six significant 
loci at the 5 x 10~ 7 level. We focus on the four SNPs analyz ed bvlGhosh et al 



(120081 ) 

because replication results are available from the study of iTodd et al.l (120071 ). The 
results in Table [3] show that the Bayesian methods considered yield similar results to 
the MLE. In this case, the prior influences the result only minimally, a fact that can 
be attributed to the strong signal in the data. 

Example III Nair et al. (2009) conducted a GWAS of Psoriasis using 438,670 
SNPs in 1,359 cases and 1,400 controls and a follow-up study on 21 promising SNPs. 
"Owing to the 'winner's curse', odds ratios estimated in the discovery sample were 
larger than those estimated in the follow-up samples" (Table 2 of Nair et al., 2009). 
The original selection of SNPs for follow-up study was based on p-value ranking 
and biological evidence. We used here a = 10~ 4 which captured all but one of the 
ten reported SNPs. The results (see Table HJ) show again that the Bayesian model 
averaging procedure leads to similar estimates as the conditional maximum likelihood. 
However, we should emphasize that the main advantage of the methods proposed here 
is the smaller variance in low-power studies, which in turn can produce more reliable 
sample size estimates for replication studies. The similarity between the estimates 
produced by our methods and the conditional likelihood approach is an added bonus 
since it shows that one does not trade bias for variance in this case. 

Example IV In the fourth setting of the on-going GWA study of longitudinal 
repeated quantitative measures of phenotype HbAlc in the Diabetes Control and 
Complications Trial (DCCT) samples, a significant locus (at a = 5 x 10~ 8 ) was iden- 
tified in the conventional treatment group with 667 samples near SORCS1 (rsl358030 
with p-value = 4.66 x 10~ 9 ). The association test is obtained via regression analysis 
of the average log(HbAlc) value vs. SNP with additive genotype coding. The naive 
estimate of the regression coefficient for rsl358030 is 0.045. However, the estimate 
obtained from the intensive treatment group with 637 samples was 0.005. Note that 
for the intensive treatment group, only the measures at the eligibility time-point (i.e. 
before the starting of the two different treatments) were used for the regression anal- 
ysis so that the two groups are comparable and the intensive treatment group could 
be used as a replication dataset. Both the MLE and the B.BMA fail to produce 
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estimates close to the replicated value, as seen in Table [5j However, the conservative 
Bayesian estimate produced by B.L is close to the replicated value. In practice, one 
would tend to use B.L or B.H based on a number of factors, some of which may be 
hard to quantify here. Certainly, a cautious researcher may lean more towards using 
the estimates produced by B.L. 

7 Conclusions and Future Work 

We propose hierarchical Bayes methods for bias reduction in statistical genetics anal- 
yses. The basis of the approach is a spike-and-slab prior which essentially allows for 
the possibility that the signal detected may be the product of chance. In addition, 
the prior permits the researchers to quantify their belief in the strength of the signal. 
Depending on the prior, inference based on the posterior distribution may be different 
from model to model. The researcher therefore faces the choice (sometimes difficult) 
between various models. We propose a Bayesian averaging method in which we use 
the data to weigh in on the more appropriate model. However, we should emphasize 
that the model averaging is not necessarily the best approach and, as in other choices 
one makes when dealing with genetics data (e.g., choice of false discovery level, num- 
ber of markers, sample size, etc) other factors may contribute to the decision of using 
a say, conservative model like B.L or ant i- conservative one like B.H. 

We have conducted additional simulation studies in which the effect of the true 
genetic effect size on the different estimators are investigated. For example, in a 
set of simulation, we fixed the sample size, n = 1,000 but vary \i so that fi = 
{0, log(l.l), /og(1.2), log(1.3), log(lA), Zog(1.5)}, and we still assumed a = 1.6855 and 
a = {0.05, 10~ 4 , 10~ 6 }. The results changed quantitatively buy the main conclusion 
remained same as the one illustrated by the set of simulation studies above. 

To apply the proposed Bayesian methods to the DCCT dataset with a quantitative 
trait phenotype, we first used the same Uniform(0,2) density for f(n) as for LOR. 
This allowed us to test the robustness of the method since the upper bound for \i 
in the regression setting can be reasonable assumed to be 0.2. To be more precise, 
note that /i is a regression coefficient in this setup and is related to the percentage of 
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phenotype variation explained by the SNP via the expression, 



o2 

2 2°X 

r = V 

where S\ = 0.4674 is the sample variance of the SNP (coded as 0, 1 and 2) and 
Sy = 0.136914 2 = 0.0187 is the sample variance of the phenotype (i.e. log(Alc) 
value). Since r 2 < 100%, thus \i < 0.2. When Uniform(0, 0.2) was used, the estimates 
were largely unchanged compared to results in Table [5] and are 0.00187 (0,0.017), 
0.027(0,0.051) and 0.0252 for B.L (hpdl), B.H (hpdl) and B.BMA respectively. This 
yet again demonstrate the robustness of the proposed Bayesian methods. 

The likelihood-based methods and the proposed Bayesian approach both correct 
for threshold effect, i.e. the SNP of interest must pass significance threshold. In prac- 
tice, another source of bias is the maximization effect. More precisely, suppose that a 
number of independent SNP are considered but only the maximum effect is estimated. 
Again, the effect estimate is biased but a likelihood-based correction is cumbersome 
since the effects included in the maximization may have different distributions and 
may depend on unknown parameters. The proposed Bayesian method only indirectly 
models the maximum effect by allowing the SNP of interest to be false positive. So 



far, the method of choice 
method of 



Sun and Bulll 



or t 



lis problem remains the bootstrap-based correction 
(j2005l ). The root of the estimation bias discussed here is 
due to the sequential analysis strategy of first selecting significant variants then es- 
timating their effects. We are currently working on methods for joint modeling of 
testing and estimation as an alternative solution to the problem discussed here. 
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Table 2: Candidate gene association study of Lymphoma by Wang et al. (2006), 318/766 cases/ controls, a = 0.002 . 





study SNP 


P-value 


Reported 


MLE(CI) 


B.L(hpdl) 


B.H(hpdl) 


B.BMA(hpdl) 


Follow up 


to 

Cn 






OR 












rsl800629 


5.7 x 10" 4 


1.54 


1.14(1,1.84) 


1.02(1,1.06) 


1.26(1,1.72) 


1.21 


1.29 




rs909253 


7.4 x 10" 4 


1.4 


1.03(1,1.59) 


1.01(1,1.03) 


1.19(1,1.51) 


1.14 


1.16 



Table 3: GWAS of T1D by WTCCC (2007), 2000/3000 cases /controls, a = 5 x 1(T 7 . 



study SNP 


WTCCC 
P-value 


WTCCC(CI) 


MLE(CI) 


B.L(hpdl) 


B.H(hpdl) 


B.BMA(hpdl) 


Follow up 


rsl7696736 
rs2292239 
rsl2708716 
rs2542151 


7.27 x 1(T 14 
1.49 x 10-9 

1.28 x 10-8 
8.4 x 10" 8 


1.37(1.27,1.49) 
1.3(1.20,1.42) 
0.77(0.70,0.84) 
1.33(1.20,1.49) 


1.37(1.25,1.49) 
1.27(1.08,1.41) 
0.81(0.71,1) 
1.15(1,1.43) 


1.36(1.24,1.49) 
1.03(1,1.28) 
0.99(0.88,1) 
1.01(1,1.04) 


1.36(1.24,1.49) 
1.24(1.07,1.41) 
0.84(0.73,1) 
1.16(1,1.37) 


1.36 
1.24 
0.85 
1.15 


1.16(1.09,1.23) 
1.28(1.20,1.36) 
0.83(0.78,0.89) 
1.29(1.19,1.40) 



Table 4: GWAS of Psoriasis by Nair et al. (2009), 1,409/1,436 cases/controls, a = 1(T 4 



study SNP 


P-value 


Reported 
OR 


MLE(CI) 


B.L(hpdl) 


B.H(hpdl) 


B.BMA(hpdl) 


Follow up 


rsl2191877 


4 x 10" 53 


2.79 


2.79(2.56,3.04) 


2.79(2.45, 3.18) 


2.79(2.46,3.21) 


2.79 


2.64 


rs2082412 


5 x 10~ 10 


1.56 


1.56(1.41,1.71) 


1.5(1,1.75) 


1.55(1.32,1.8) 


1.55 


1.44 


rsl7728338 


2 x 10~ 7 


1.72 


1.67(1.35,1.97) 


1.08(1,1.75) 


1.6(1.18,2.09) 


1.57 


1.59 


rs20541 


6 x 10-6 


1.37 


1.26(1,1.50) 


1.01(1,1.11) 


1.22(1,1.47) 


1.19 


1.27 


rs610604 


1 x 10-5 


1.28 


1.18(1,1.38) 


1.01(1,1.06) 


1.16(1,1.35) 


1.14 


1.19 


rs2066808 


2 x 10~ 5 


1.68 


1.26(1,1.98) 


1.02(1,1.09) 


1.32(1,1.84) 


1.27 


1.34 


rs2201841 


3 x 10~ 7 


1.35 


1.32(1.11,1.46) 


1.04(1,1.34) 


1.29(1.08,1.51) 


1.28 


1.13 


rsl076160 


2 x 10-5 


1.26 


1.11(1,1.36) 


1.01(1,1.03) 


1.13(1,1.31) 


1.11 


1.09 


rsl2983316 


2 x 10-5 


1.36 


1.15(1,1.50) 


1.01(1,1.08) 


1.18(1,1.44) 


1.15 


1.09 



Table 5: GWAS of HbAlc (complication of T1D) by Paterson et al. (personal communication), 637 population samples with 
T1D, a = 5x 1(T 8 



study SNP 


P-value 


Reported 
OR 


MLE(CI) 


B.L(hpdl) 


B.H(hpdl) 


B.BMA(hpdl) 


Follow up 


rsl358030 


4.66 x 10~ 9 


0.045 


0.0323(0,0.0552) 


0.00194(0,0.01871) 


0.0269(0,0.0507) 


0.0254(0,0.0477) 


0.005 



